In [1]:
# Style Similarity
In [2]:
# Import libraries
import numpy as np
import pandas as pd
# Import the data
import WTBLoad
wtb = WTBLoad.load()
Question: I want to know how similar 2 style are. I really like Apricot Blondes, and I want to see what other styles Apricot would go in. Perhaps it would be good in a German Pils.
How to get there: The dataset shows the percentage of votes that said a style-addition combo would likely taste good. So, we can compare the votes on each addition for any two styles, and see how similar they are.
In [3]:
import math
# Square the difference of each row, and then return the mean of the column.
# This is the average difference between the two.
# It will be higher if they are different, and lower if they are similar
def similarity(styleA, styleB):
diff = np.square(wtb[styleA] - wtb[styleB])
return diff.mean()
res = []
# Loop through each addition pair
wtb = wtb.T
for styleA in wtb.columns:
for styleB in wtb.columns:
# Skip if styleA and combo B are the same.
# To prevent duplicates, skip if A is after B alphabetically
if styleA != styleB and styleA < styleB:
res.append([styleA, styleB, similarity(styleA, styleB)])
df = pd.DataFrame(res, columns=["styleA", "styleB", "similarity"])
In [4]:
df.sort_values("similarity").head(10)
Out[4]:
In [5]:
df.sort_values("similarity", ascending=False).head(10)
Out[5]:
In [6]:
def comboSimilarity(styleA, styleB):
# styleA needs to be before styleB alphabetically
if styleA > styleB:
addition_temp = styleA
styleA = styleB
styleB = addition_temp
return df.loc[df['styleA'] == styleA].loc[df['styleB'] == styleB]
comboSimilarity('Blonde Ale', 'German Pils')
Out[6]:
But is that good or bad? How does it compare to others?
In [7]:
df.describe()
Out[7]:
We can see that Blonde Ales and German Pils are right between the mean and 50th percentile, so it's not a bad idea, but it's not a good idea either.
We can also take a look at this visually to confirm.
In [8]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
n, bins, patches = plt.hist(df['similarity'], bins=50)
similarity = float(comboSimilarity('Blonde Ale', 'German Pils')['similarity'])
# Find the histogram bin that holds the similarity between the two
target = np.argmax(bins>similarity)
patches[target].set_fc('r')
plt.show()